Congresso Brasileiro de Microbiologia 2023 | Resumo: 53-2 | ||||
Resumo:Whole genome sequencing substantially improved the discovery of genetic markers and revolutionized the field of evolution of populations. Currently, there are two main types of sequencing: short reads, e.g. Illumina, and long reads, e.g. Oxford Nanopore Technology. In 2010, in Brazil, there was an outbreak of diarrhea caused by Escherichia coli serotype O3:H2. Understanding the virulence factors of pathogenic isolates, as well as the acquisition and transfer of genes, is essential to elucidate the evolutionary mechanisms and develop prevention and control strategies. The objective of this study was to compare different genome assembly strategies, using different assemblers, to evaluate which ones generated circularized chromosomes and plasmids that are closer to the expected genetic composition. Short-read and long-read sequencing was performed using Illumina and Nanopore technology, respectively, of the genome of the eight pathogenic E. coli isolates obtained from the diarrheal outbreak. Different assemblers were used: SPAdes for assembly only with short reads and Flye for only long reads, in addition to hybrid assemblers, which used both, short and long reads, such as Unicycler and Pilon; later, the polishing step using the long and short reads was performed, respectively, with Medaka and Polipolish. Gene annotation was carried out using Prokka and Roary was used to define the Pan and Core genome of the assembled genomes, creating a way to describe the pattern of presence and absence of genes in each of them. Bowtie2 mapping was performed to validate the nucleotide composition of regions of interest in some cases. As a result, it was possible to assemble the chromosome and plasmid of some isolates as circular molecules using Flye and hybrid assemblies (Unicycler and Pilon); however, with Illumina data alone this was not possible. Furthermore, there was a difference in the number of genes between each of the assemblies; while E. coli species has about 4723 genes, the average number of genes detected was 4572.3 in the assemblies carried out with the Illumina reads, and, conversely, in the assemblies carried out only with the Nanopore reads it was 7792.5, while, in the hybrid assemblies, it was obtained values between these cited. Using the pan and core genome analysis, it was noticed that regions with high nucleotide similarity presented differences in gene presence prediction. To further evaluate it, we mapped the short reads to these regions and, it was noted that the assembler had inserted a "point mutation", generating a stop codon, which would explain the lack of detection of the gene. After polishing, the previously detected "point mutations" were corrected. When genomes were annotated, some level of heterogeneity was detected between different assemblies, although they had extremely similar genomic regions. Indeed, in some genomes, a gene related to bacterial virulence was annotated as present and, in others, not. We can conclude that different types of sequencing and different assembly strategies generate great heterogeneity of results, which includes differences between the quantity and diversity of genes detected in the samples. Failure to detect genes can underestimate the pathogenic potential of an isolate, as well as this can lead to a erroneous antimicrobial treatments. Our data reinforce the need to use short and long reads, as well as polishing, to minimize possible errors and generate a more reliable assembly. Palavras-chave: hybrid assembly, genome assembly, Escherichia coli, virulence Agência de fomento:São Paulo Research Foundation (FAPESP) |